Flower_Power3.png

The Data Preparation phase of the IBM's CRISP model is about preparing the final dataset to be used in the later stages. In this phase, the table, record, and attribute selection, as well as transformation and cleaning are considered. Practice shows that these tasks are likely to be performed multiple times.

1. Libaries and Data Imports

2. Excel File Processing

2.1. Unique flower names

In the previous two steps, we noticed that some of the unique names are repeating due to the whitespaces at the end of a few cells. Furthermore, an example can be observed in the first row. Thus, we had to remove each empty space at the end of each cell. The code for this is shown in the next line.

2.2. Fixing typing mistakes

From the EDA that we have made, we have plotted all types of units and we noticed that two typing mistakes were made. Therefore, we are replacing the single values with the same name as the unit with multiple values. ("spike - Spike; "Umrel - Umbrel").

2.3. Splitting the location

Since the column "location" contains the city and extra information regarding where the photo was taken such as district, we have splitted the values of the column into two different columns - "location(the city)" and "district".

2.4. Filling missing values for date

Since the date is only filled for the first row per visit, the following code fills in the NaN values of the rows with missing data. The funcition used is ‘ffill’ which stands for ‘forward fill’ and replaces the missing values with the corresponding value in the previous row.

2.5. Creating a "month" column

3. Labeling of the images

The code below is used to label the images according to the english name of the flower. Additionally, there is a function which concatenates the name of the flower to the name of the image. Also, the pictures which are not present in the excel file are stored in a different folder called "Extra images". Furthermore, a csv file with the duplicated images from the dataset (2 photos having the same photo_id) is created. Also, we have added screenshots of how the end result looks like.

extra_images.png

labeled_images.jpg

4. Conclusion

Overall, the excel file needed some pre-processing due to the mulitple typos. Moreover, the client had whitespaces in some of the column names as well as in some of the rows. Also, the excel file was not that well structured - around the table, there were different headings and paragraphs which are irrelevant for our AI model. Furthermore, the way in which the images were stored was a bit confusing and impractical for people who are not familiar with this field.

After the Data Preparation phase was completed, we have a structured environment with folders of flowers per flower name and a standardized excel file.